scalable vit framework
AiluRus: A Scalable ViT Framework for Dense Prediction
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution.
Appendix for AiluRus: A Scalable ViT Framework for Dense Prediction
We deploy AiluRus to object detection tasks. The results are presented in Tab. Therefore, we present the assignment statistics in Fig. A-1a, where we deploy AiluRus on Segmenter ViT -L and perform clustering on the output of the second layer. As shown in Fig. A-1b, the Despite its ability to accelerate various dense prediction tasks, AiluRus has some limitations. FPN and the complicated decoder.
AiluRus: A Scalable ViT Framework for Dense Prediction
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution.